Data partition handoff between storage clusters

ABSTRACT

One example provides a method of migrating a data partition from a first storage cluster to a second storage cluster, the method including determining that the data partition meets a migration criteria for migrating from the first storage cluster to the second storage cluster, on the first storage cluster, preparing partition metadata to be transferred, the partition metadata describing one or more streams within the data partition and one or more extents within each stream, transferring the partition metadata from the first storage cluster to the second storage cluster, directing new transactions associated with the data partition to the second storage cluster, including while the one or more extents reside at the first storage cluster, on the first storage cluster, changing an access attribute of the one or more extents within the data partition to read-only, and on the second storage cluster, performing new ingress for the data partition.

BACKGROUND

Customers of a distributed storage system may use a storage account tostore their data in the distributed storage system. A particulargeographical region of the distributed storage system may include one ormore data center buildings, and each data center building may includemultiple storage clusters. A storage cluster is a collection of servers(nodes) running a common distributed software, e.g. a collection ofsoftware services. Each storage cluster serves plural (e.g., severalhundred to several thousand) storage accounts and associatedtransactions, which utilize central processing unit (CPU) resources oneach of the nodes.

A distributed storage system may migrate a storage account from onestorage cluster to another storage cluster for various reasons, such asto alleviate capacity pressure in the storage cluster, to balance CPUand input/output operations per second (IOPS) among storage clusterswithin a region, and/or to decommission a storage cluster. Further, if alive site is running on the storage cluster, the distributed storagesystem may migrate some impacted storage accounts to another storagecluster.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter. Furthermore,the claimed subject matter is not limited to implementations that solveany or all disadvantages noted in any part of this disclosure.

Examples are disclosed that relate to data migration in a distributedcomputing system. One example provides, enacted on a computing system, amethod of migrating a data partition from a first storage cluster to asecond storage cluster. Each storage cluster may be implemented via oneor more server computers. The method comprises determining that the datapartition meets a migration criteria for migrating from the firststorage cluster to the second storage cluster; on the first storagecluster, preparing partition metadata to be transferred, the partitionmetadata describing one or more streams within the data partition andone or more extents within each stream; transferring the partitionmetadata from the first storage cluster to the second storage cluster;directing new transactions associated with the data partition to thesecond storage cluster, including while the one or more extents resideat the first storage cluster; on the first storage cluster, changing anaccess attribute of the one or more extents within the data partition toread-only; and on the second storage cluster, performing new ingress forthe data partition.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 schematically shows an example computing environment forimplementing a distributed storage system.

FIG. 2 schematically shows a state of a source storage cluster and adestination storage cluster prior to a partition handoff.

FIG. 3 schematically shows aspects of a preparation phase of a partitionhandoff.

FIG. 4 schematically shows aspects of a handoff phase of the partitionhandoff.

FIG. 5 schematically shows aspects of resource balancing among storagecluster groups.

FIG. 6 schematically shows aspects of parallel migrations to multipledestination storage clusters.

FIG. 7 shows a flowchart illustrating an example method of migrating adata partition from a first storage cluster to a second storage cluster.

FIG. 8 shows a block diagram illustrating an example computing system.

DETAILED DESCRIPTION

FIG. 1 shows an example computing environment 100 that includes adistributed storage system 102. Data is stored in storage stamps 104 and106, where each storage stamp (also referred to herein as a “storagecluster”) is a cluster of N racks of storage nodes, and where each rackis built out as a separate fault domain with redundant networking andpower.

The distributed storage system 102 implements a global storage namespacethat allows data to be stored in a consistent manner. In some examples,the storage namespace utilizes the Domain Name System (DNS) 108 and mayinclude two parts: an account name (customer selected name and part ofthe DNS host name), and an object name (identifies individual objectswithin the account). Accordingly, all data is accessible via a UniformResource Identifier (URI) that includes the account name and the objectname if individual objects are being accessed.

A storage location service 110 is configured to manage all storagestamps, and is responsible for managing the account namespace across allstorage stamps. To increase storage, more storage stamps may be deployedin a data center and added to the storage location service 110, and thestorage location service 110 may allocate new accounts to the newstorage stamps while also load balancing existing storage accounts fromold stamps to new stamps. The storage location service 110 tracksresources used by each storage stamp across all locations, and when anapplication requests a new account for storing data, the storagelocation service 110 specifies the location affinity for the storage,and chooses a storage stamp within that location as the primary stampfor the account based on load information across all stamps. The storagelocation service 110 then stores the account metadata information in thechosen storage stamp, instructing the stamp to start receiving trafficfor the account. The storage location service 110 also updates DNS 108to allow requests to route from a client computing device 112 to thatstorage stamp's virtual IP (VIP), as shown at 114 and 116 respectivelyfor storage stamps 104 and 106.

Each storage stamp has three layers, which from the bottom up are: (1) astream layer, at 118 and 120, that stores bits on disk and is configuredto distribute and replicate data across servers within a storage stamp;(2) a partition layer, at 122 and 124, configured to manage higher leveldata abstractions (e.g. blobs, tables, queues), provide a scalableobject namespace, store object data on top of the stream layer, providetransaction ordering and consistency for objects, send transactions toother storage stamps, and cache object data to reduce disk I/O; and (3)a front-end layer, at 126 and 128, that receives incoming requests,authenticates and authorizes the requests, and routes the requests to apartition server in the partition layer.

Intra-stamp replication may be used within the stream layer, which issynchronous replication that keeps enough replicas of the data acrossdifferent nodes in different fault domains to keep data durable withinthe storage stamp. Intra-stamp replication replicates blocks of diskstorage that are used to make up objects. Further, inter-stampreplication may be used within the partition layer, which isasynchronous replication that replicates data across storage stamps.Inter-stamp replication replicates objects and transactions applied tothose objects. Intra-stamp replication provides durability againsthardware failures, whereas inter-stamp replication providesgeo-redundancy against geo-disasters.

The front-end layer 126, 128 of a storage stamp 104, 106 includesstateless servers that receive incoming requests (e.g. from clientcomputing device 112). Upon receiving a request, a front-end serverlooks up an account name associated with the request, authenticates andauthorizes the request, and routes the request to a partition server inthe partition layer 122, 124 (e.g. based on a partition name). Thedistributed computing system 102 maintains a partition map of partitionname ranges and which partition server is serving which partition name.The front-end servers may cache the partition map and use the partitionmap to determine to which partition server to forward each request. Thefront-end servers also may stream large objects directly from the streamlayer and cache frequently accessed data.

The stream layer 118, 120 acts as a distributed file system within astorage stamp 104, 106. The stream layer understands files called“streams”, which are ordered lists of pointers to extents. Extents arelogical blocks of data stored at some physical location on disk. Whenthe extents are concatenated together, the extents represent the fullcontiguous address space in which the stream can be read in the orderthe extents were added to the stream. A new stream can be constructed byconcatenating extents from existing streams. Extents are units ofreplication in the stream layer. Each of a plurality of data storagenodes in the stream layer maintains storage for a set of extent replicas(e.g. three replicas within a storage stamp for each extent).

While data is stored in the stream layer, it is accessible from thepartition layer 122, 124. The partition layer 122, 124 maintains ahighly scalable table that includes object-related metadata (objectname, storage account that stores the object, etc.) and forms a primarykey. The table—also referred to herein as a partition index—includespointers to corresponding data blocks on the disks. In this manner, thepartition layer keeps track of the streams, extents, and byte offsets inthe extents in which objects are stored. While the partition layer mayonly know the logical position of each extent, the stream layermaintains a mapping of where each extent is physically stored. The tableincludes millions of objects that cannot be served by a single server,so the table is broken into units called “partitions”. The partitionlayer partitions all data objects within a storage stamp 104, 106 andprovides data structures (e.g. blobs, queues, tables) for supportingdifferent cloud services. Partition servers (daemon processes in thepartition layer) and stream servers may be co-located on each storagenode in a storage stamp.

A partition is a collection of multiple streams and each stream is acontainer of multiple extents. Each stream within a partition may servea different purpose. For example, the table may be stored as extents inone of the streams (e.g. an index stream), whereas data extents may bestored in a different stream (e.g. a data stream), and other streams(e.g. a metadata stream) may store other extents. Each partition layerserves a set of objects in a key range ranging from a KeyLow to aKeyHigh, where each object comprises a partition name. Objects arebroken down into disjointed ranges based on the partition name valuesand served by different partition servers of the partition layer. Thus,the partition layer manages which partition server is serving whichpartition name ranges for blobs, tables, and queues. Each partitionconstitutes the table itself, the key range where the objects arestored, and the data blocks themselves (stored in streams). A tableserver (TS) serves the partition. Further, the partition layer providesautomatic load balancing of partition names across the partition serversbased upon traffic.

As mentioned above, users (e.g. businesses, individuals, or otherentities) of the distributed storage system 102 may use a storageaccount to store data in the distributed storage system 102. A storageaccount functions as a container where the customer stores a collectionof objects associated with the storage account. A customer may utilizeany number of storage accounts, and each storage account may includelimits on TPS, bandwidth for receiving/sending data, etc.

At times, one or more storage accounts on a storage cluster may beperforming heavy workload. In such instances, the distributed computingsystem 102 may perform load balancing by migrating one or more selectstorage accounts to a different storage cluster. The distributedcomputing system may migrate a storage account from a source storagecluster to a destination storage cluster for various reasons, includingbut not limited to balancing CPU resources, IOPS, and/or transactionsper second (TPS) among storage clusters, alleviating capacity pressurein a source storage cluster, decommissioning a storage cluster, and/orreducing impact of live sites running on a storage cluster.

Currently, account migration may involve deep copying of all objects(blobs, disks, tables, queues, etc.) and the underlying data in thestorage account from the source storage cluster to another storagecluster within the same geographical region. This deep copying involvesiterating through each object in a partition (which may utilizeparallelism), reading the objects and the underlying data, dispatchingthe data to the destination storage cluster, and verifying the objectsand underlying data between the source and destination storage clusters(e.g., to ensure that the copy is correct, there are no software bugs inthe copy process, etc.). The data verification may involve computing ahash on both the source storage cluster and the destination storagecluster and verifying that the computed hashes match. Once the objectsand data are verified, customer traffic is reopened at the destinationstorage cluster instead of at the source storage cluster.

A storage account may include millions of objects, including objects ofdifferent types (e.g., blobs, disks, tables, queues, etc.). Moving allthe objects within a storage account, and moving at scale, poses variouschallenges. For example, iterating through each object in a partition isa CPU-intensive process. Further, the above-described account migrationprocess requires a disk I/O per object to be migrated, regardless ofobject size. The millions of I/O required to migrate millions of objectsutilizes a considerable amount of CPU resources and bandwidth. When aselect storage cluster is running hot on CPU and/or bandwidth usage, thedistributed computing system 102 may be unable to schedule enough objectmigrations to stabilize the storage cluster. The distributed computingsystem 102 also may be unable to complete some critical migrations, suchas those related to a live site on a storage cluster, fast enough toprevent impact to other storage accounts on the storage cluster.Further, such load balancing operations interrupt service to a storageaccount being migrated.

In addition to the above problems, current account migration approachesmay not be feasible for certain storage accounts. As one example,current account migration processes may be too timely and computingresource-intensive for a storage account with a large object count(large number of blobs, table entity count, queue message count, etc.),due to the disk I/O required to move each object. As another example,current account migration processes involve copying and migrating datain layers on the order of 1 kilobyte (KB) in size. Thus, migrating astorage account with a large size (e.g., several petabytes (PB)) maytake an unreasonably long time, such as months, to migrate. Further, forstorage accounts with high TPS, the rate of data transfer during such amigration process may not be sufficient to keep up with ongoingtransactions. Similarly, for high ingress accounts, if a rate oftransfer of objects during account migration is slower than an incomingrate from account users, then the account migration may not reachcompletion. For example, once a deep copy of existing data is complete,the distributed storage system operates in a “catch-up” mode to transferrecently received data on the source storage cluster to the destinationstorage cluster, which may not be possible if ingress is too high. Asyet another example, a storage account may exceed a threshold number ofblob snapshots (a point-in-time copy of an object) or snapshot creationfrequency may outpace an account migration process, and a subsequentmigration verification process may lag behind user storage accountingress. Further, for any of the above reasons, the aforementionedaccount migration processes may not be feasible for premiumservice-level accounts to which service cannot be interrupted.

Accordingly, examples are disclosed herein that relate to migrationoperations in partition metadata transfer is decoupled from the datablocks (extents) themselves. Rather than performing a deep copy of thedata within a storage account and moving the data between storageclusters, the disclosed migration operations involve handing off apartition (all objects within a key range) or a group of partitionsconcurrently from a source storage cluster to a destination storagecluster. Because the disclosed partition handoff involves transferringownership of extents (partition index and data) from the source storagecluster to the destination storage cluster while the extents themselvesmay still reside at the source storage cluster, the handoff may befaster than current data copy migration operations. By transferringextent ownership, subsequent incoming write requests land on thedestination storage cluster, which provides relief to resources of thesource storage cluster that were previously occupied with incomingtraffic. This also avoids scenarios in which ingress is too high tofacilitate account migration, since ingress points at the destinationstorage cluster. Further, the migration operations described herein mayoccur as background processes without interrupting storage accountservice, thereby appearing transparent to a user(s) of the storageaccount.

Briefly, a partition handoff from a source storage cluster to adestination storage cluster includes a preparation phase and a handoffphase. During the preparation phase, the source storage clustertransfers extent metadata to the destination storage cluster. As thisextent metadata transfer is decoupled from ingress, high ingress doesnot impact the metadata size. The preparation phase, which does notimpact user traffic to a storage account, helps to quicken the handoffphase, during which user traffic may be regulated.

In some examples, the partition handoff process is orchestrated by atable master (TM) of the source storage cluster. A TM manages multipletable servers (TSs), determines which table server hosts whichpartition, and updates the partitions table. In other examples, anexternal driver in communication with both the TM of the source storagecluster (TM-Source) and the TM of the destination storage cluster(TM-Dest) may orchestrate the partition handoff process. While theexamples described hereinafter involve communication between theTM-Source and the TM-Dest, in other examples an external agent may drivethe partition handoff process.

Prior to the partition handoff, the distributed computing system 102pairs a source storage cluster and a destination storage cluster for anaccount being migrated. Pairing the two storage clusters may comprise,for example, programming firewall rules to allow traffic between the twostorage clusters, and enabling name resolvers on each storage cluster toknow about the role information of each cluster. Further, pairing thetwo storage clusters may involve enabling shared front end (FE)resources, such that every FE in each storage cluster participating inthe pairing downloads and maintains a partition map of each storagecluster participating in the pairing. By enabling shared FE, an FE isable to re-direct user requests to the correct partition serving a keyrange for any account belonging to the paired storage clusters. It willbe understood that virtualizing the storage account being migrated priorto partition handoff, such that the DNS of the storage account points tothe VIPs of both storage clusters, may optionally be performed.

FIG. 2 illustrates an example source storage cluster (cl1) 202 anddestination storage cluster (cl2) 204 prior to a partition handoff. Inthis example, a table server (TS-48) 206 serves a first partition (P1)208 belonging to an account being migrated from the source storagecluster 202 to the destination storage cluster 204. A cluster servicemanager of the source storage cluster (CSM-Source) 210 owns and managesthe streams (one of which is shown at 212) and underlying extents(indicated as extents E1 to En) of the first partition 208. Morespecifically, the CSM-Source 210 maintains logical to physical mappingfor extent storage.

The FE 214, 216 roles are shared, such that the FE roles on both storageclusters 202, 204 maintain the partition tables 218, 220 for bothclusters in-memory and refresh the partitions tables (e.g. periodically,or when an entry is invalidated). When the storage account beingmigrated is virtualized, user requests for the storage account may bedirected to FE roles on both clusters 202, 204 and re-directed to thefirst partition 208 on TS-48 206 of the source storage cluster 202. Whenthe storage account is not virtualized, user requests may be directed toFEs 214 of the source storage cluster 202.

In the preparation phase, the TM-Source 222 interacts with the TS of thesource storage cluster (TS-Source) 206 and the TM-Dest 224, which eachinteract with their respective CSM (CSM-Source 210 and CSM-Dest 226) toprepare for the partition handoff. The TM-Source 222 informs the TM-Dest224 to prepare the streams managed by the CSM-Dest 226, which willreceive extents from the CSM-Source 210. As mentioned above, to keep thehandoff phase lightweight, the preparation phase involves transferringthe extent metadata from the source storage cluster 202 to thedestination storage cluster 204.

FIG. 3 schematically depicts steps involved in the preparation phase 300of a partition handoff. As indicated by arrow (1), the TM-Source 222quarantines the key range of the partition, blocks splits and merges onthe key range, and persists an intention to begin the handoff of thepartition in the source partitions table 228. Though the TM-Source 222blocks splits and merges on the key range, high-priority offloads arepermitted. TM failover may cause the new TM to read this information andre-execute the preparation workflow. The information persisted in thesource partitions table 228 may include a flag indicating the beginningof preparation for partition handoff on the source storage cluster. Anexisting partitions table for a current data migration system mayundergo a schema upgrade or otherwise retrofit columns to accommodatesuch flags.

Next, as indicated by arrow (2), the TM-Source 222 performs an initialhandshake with the TM-Dest 224, which informs the TM-Dest 224 of theintention to handoff. In some instances, the initial handshake isperformed via an asynchronous application programming interface (API)request (e.g. a request to prepare for partition handoff on theTM-Dest). The initial handshake includes sending values defining a lowend and a high end of a key range for the partition and a partitionname. In various examples, the initial handshake may include sending ashort name for the partition rather than the partition name, which mayhelp ease debugging.

Next, as indicated by arrow (3), the TM-Dest 224 quarantines the keyrange. The TM-Dest 224 may block the new key range by splitting anexisting partition on the destination storage cluster 204, such that thenew key range becomes part of an existing partition's key range. TheTM-Dest 224 also creates a partition entry in the partitions table 230on the destination storage cluster 204. In the partition entry, thetable server instance may not be populated, and information regarding astate of the partition being received may stored as a flag in a flagscolumn of the partitions table. In one specific example, a flagindicating that a partition handoff is in progress on the destinationstorage cluster may represent the state of the partition and alsoinstruct the TM-Dest 224 to not load the partition on any TS-Dest (232in FIG. 2) upon failover, as this partition is not yet in a valid stateto do so. In some examples, the TM-Dest 224 may batch updates to thedestination partitions table 230 for the quarantined key ranges(left-hand side and right-hand side partitions of the partition beinghanded off, and the partition being handed off itself). If the TM-Dest224 fails over, the TM-Source 222 may continue to retry. This step maybe configured to be idempotent so that the TM-Dest 224 is able toverify/complete the unfinished work in the retry attempt. The TM-Dest224 need not save a command in the transaction log, as the TM-Source 222polls/retries the asynchronous API to prepare for partition handoff.

As indicated by arrow (4), the TM-Source 222 polls for the completion ofthe TM-Dest 224 asynchronous API call. When successful, the TM-Source222 requests the TS-Source 206 to prepare for partition handoff on theTS-Source 206, for example, via an asynchronous API call.

When the TS-Source 206 receives the request, the TS-Source 206 performsvarious actions, as indicated by arrow (5). As the requested operationis an idempotent operation, the API is expected to complete thepreparation step reliably even in instances where the TM-Source 222fails over and/or the TS-Source 206 crashes or restarts. The TS-Source206 checkpoints its memory-table and persists handoff state informationin its metadata stream record including partition flags with handoffstate information (e.g. indicating that preparation for partitionhandoff on the source storage cluster is in progress) and the sourcecluster name. If the partition reloads for any reason during this step(e.g. TS-Source crashes or restarts, emergency offload, forcefulpartition reassignment from a storage diagnostics service for alivesite, etc.), the TS-Source 206 may re-execute the steps involved inthe preparation for TS-Source partition handoff during reload using themetadata stream record. For example, the TS-Source 206 may submit a jobto a lazy worker (e.g. a thread pool). The TS-Source 206 also blocksmulti-modifies, deletes, and stream operation on the partition streams,which may help simplify extent handoff between the CSM-Source 210 andCSM-Dest 226. At this stage, newer writes may still be permitted to thepartition, extents may be sealed, and newer extents may be created.Further, linking extents at the end of the stream—although amulti-modify operation—may be allowed for copy blob cases.

The TS-Source 206 also informs the CSM-Source 210 to create mirrorstreams 234 on the destination storage cluster 204 and prepare forextent handoffs for each stream 212, as indicated by arrow (6). TheTS-Source may inform the CSM-Source 210 via an API configured toassociate the stream 212 (whose extents are being handed off) of thesource storage cluster 202 with the stream on the destination storagecluster 204. In response, the CSM-Source 210 transfers a bulk of theextent metadata to the CSM-Dest 226 in order to make the handoff phaselightweight. The steps taken by the TS-Source to block major streammodification operations may help to prevent any significant changes tothe extent metadata after the preparation phase.

In one example, the CSM-Source 210 and the CSM-Dest 226 perform thefollowing actions at arrow (6.1) to complete the preparation phase. TheCSM-Source 210 creates partition streams on the destination storagecluster 204, which are empty (e.g. do not yet have any extents) inpreparation to receive extents from the source storage cluster 202. TheCSM-Source 210 works with the CSM-Dest 226 to copy “sealed” extentsmetadata, with sealed extents being immutable. The CSM-Source 210performs validations for feasibility of handing over the extents to thedestination storage cluster 204, such as a confirmation check forconnectivity with the CSM-Dest 226, a codec compatibility check, and/ora determination of any data unavailability on the destination storagecluster 204. At this stage, the extents are still managed by theCSM-Source 210 and if extent nodes fail and the extents have to bereplicated/repaired, then the CSM-Source 210 is responsible for thereplication/repair. Where the sealed extents metadata on the CSM-Dest226 may be slightly stale, this may be resolved during the handoff phase(e.g. using extent metadata locally managed tablespaces, etc.), which isdescribed in detail below. The CSM-Dest 226 may also verify the extentsmetadata post-transfer as part of this call by syncing with extent nodes(EN nodes) serving those extents. If an extent is held by one or morestreams belonging to the same partitions, the extent ownership may betransferred to the CSM-Dest 226. If an extent is linked to multiplestreams owned by different partitions, the extent ownership may not beimmediately transferred to the CSM-Dest 226, and instead transferredonce all the different partitions are handed off to the CSM-Dest 226.

The extent metadata transfer performed in the preparation phase 300 isdecoupled from ingress. Regardless of how high ingress is, the extentmetadata size may remain relatively unchanged. This allows thedistributed storage system to perform little to no “catch-up” totransfer recently received data on the source storage cluster to thedestination storage cluster. In one specific example, each extent of astream is three gigabytes (GB) in size with 30 megabytes per second(MBps) ingress into the stream. In this example, the ingress may createone new extent every one hundred seconds, and the number of new extentscreated during the preparation phase 300 may be less than ten. Inanother specific example, for a stream comprising fifty thousand extents(with erasure code fragments), the metadata transfer is completed in twoseconds or less. In this example, each extent metadata is less than fivehundred bytes and batch transferred, so the catch-up needed with extentmetadata transfer may be nearly negligible, even for high ingressaccounts.

As mentioned above, one or more partition flags and/or a partitionmetadata stream record for the partition handoff may be updated duringthe preparation phase 300. For example, in the steps indicated by arrows(1) and (2) in FIG. 3, the TM-Source 222 may update a partition flag inthe source partitions table 228 to indicate a state of beginning toprepare for partition handoff on the source storage cluster. As anotherexample, in the steps indicated by arrows (3) and (4) in FIG. 3, theTM-Dest 224 may update a partition flag in the destination partitionstable 230 to indicate that partition handoff is in progress on thedestination storage cluster 204. At this point, the source partitionstable 228 may still include the partition flag indicating the state ofbeginning to prepare for partition handoff on the source storagecluster. As yet another example, in the steps indicated by arrows (5),(6), and (6.1) in FIG. 3, the TM-Source 222 may update the partitionmetadata stream record for the partition handoff to indicate thatpreparing for partition handoff on the source storage cluster is inprogress.

Once the preparation phase 300 is complete, the first partition 208belonging to the account being migrated from the source storage cluster202 to the destination storage cluster 204 is still served by TS-48 206of the source storage cluster 202. A table server TS-156 (shown as 232in FIG. 3) on the destination storage cluster 204 has created thepartition streams for the first partition 208, to hold the extents beinghanded off in the handoff phase. The first partition's streams and theunderlying extents are still owned and managed by the CSM-Source 210.However, the CSM-Dest 226 has established a secondary ownership of atleast some of the sealed extents. The FE roles behavior may remainunchanged after the preparation phase 300, such that user requests forthe storage account may be directed to FE roles on both clusters (e.g.via VIPs 114, 116) and be correctly re-directed to the first partition208 on TS-48 206 of the source storage cluster 202.

The TM-Source may begin the handoff phase once the preparation phase 300is complete. FIG. 4 illustrates aspects of an example handoff phase 400in which the extents are handed off to the CSM-Dest 226. In someinstances, user traffic to the storage account being migrated isregulated during the handoff phase 400.

As indicated by arrow (1), the TM-Source 222 updates the partition flagin the source partitions table 228 for the partition being handed-off byremoving a flag indicating a beginning of preparation for partitionhandoff on the source and setting a flag indicating a beginning of thepartition handoff on the source. Updating the flags may help with TMfailover cases so that the TM-Source 222, when reconstructing itsin-memory state from the source partitions table 228, knows at whichstep to resume the hand-off operation.

As indicated by arrow (2), the TM-Source 222 issues a command to theTS-Source 206 to perform partition handoff from the TS-Source, which maybe an asynchronous API request in some examples. In response, theTS-Source 206 performs a sequence of processes, as indicated by arrow(3).

At arrow (3), the TS-Source 206 persists the partition handoff stateinformation, such as the partition flags with hand-off state informationand the source cluster name, in its metadata stream record. In instancesthat the partition reloads for any reason (e.g. the TS-Source crashes orrestarts, emergency offload, forceful partition reassignment, etc.), theTS-Source 206 may re-execute the steps involved in the partition handofffrom the TS-Source API during reload using the metadata stream record.For example, the TS-Source may re-execute by submitting a job to a lazyworker. When the partition is reloaded on the source storage cluster202, e.g. in instances that the TS-Source 206 crashes or restarts duringhandoff, the new TS-Source loading the partition may detect the case andresume partition handoff. When an extent handoff is successful and thepartition loads on the destination storage cluster, the record is readby the destination storage cluster 204, which knows that the partitionis in a handoff state from the source storage cluster 202.

The TS-Source 206 also blocks new requests in the same or a separatewrite to the metadata stream as the partition handoff state information.This causes the FE to back off and retry. As the actual handoff phasemay potentially fail, the FEs direct to the same table server until thehandoff phase is successful. This also may help to simplify rollback.Further, the TS-Source 206 sends a command via an API to the CSM-Source210 to complete the handing off of extents belonging to one or morestreams of the first partition 208 on the source storage cluster 202 tothe destination storage cluster 204.

The CSM-Source 210 interacts with the CSM-Dest 226 to complete theextent handoff process, as indicated by arrow (4). The CSM-Source 210226 seals unsealed extents on the source storage cluster 202 andcomplete the handoff of extents belonging to each stream of the firstpartition 208. Completing the handoff of the extents of each stream mayinvolve, for example, transferring metadata for stale extents metadataand the new extents to the CSM-Dest 226, and the CSM-Dest 226 assumingownership of the extents. The CSM-Source 210 further may change anaccess attribute of the extents to read-only, thereby invalidating theextents at the source storage cluster. The CSM-Source 210 also mayperform a scrub and/or a validation of the extents handed off, to ensurethat the extents are intact on the destination storage cluster 204. Morespecifically, the CSM-Source 210 ensures that none of the extents of thesource storage cluster stream 212 are missing in the correspondingstream 234 on the destination storage cluster 202. If either of thesesteps fail or a deadline passes, the TS-Source 206 may return a failureerror code to the TM-Source 222 and resume serving user requests. Inthis example, the TM-Source 222 is responsible for re-trying orproceeding with aborting the handoff phase. Further, the CSM-Source 210performs a release partition command to invalidate a partition entry inthe partition map in-memory.

As indicated by arrow (5), the TM-Source 222 updates its own partitionstable 228 by adding a redirection for the key range to point to thedestination storage cluster 204 and updating the partition flag toindicate that partition handoff from the source is completed. Theupdated partition flag indicates that work by the source storage cluster202 is complete and the handoff phase is pending on the destinationstorage cluster 204. This way, the TM-Source 222 knows where to resumethe handoff phase if the TM-Source 222 fails over. The redirection entryin the source partitions table 228 may comprise a string containing acluster name in place of the table server serving the key range. FEsmaintain partition maps for the source and destination storage clusters,so the redirection entry in the source partitions table 228 to thedestination partitions table 230 may be an in-memory lookup during TSname resolution. Whenever a partition map entry is invalidated, the FEfetches all partitions information, resolves all TS endpoints (tableservers serving the respective key ranges), and maintains the resolvedendpoint object associated with each partition entry, so that the FE mayreadily serve requests. The redirection entry does not add overhead touser requests, as the redirection entry is used for name resolutionperformed outside of the user request path.

As indicated by arrow (6), the TM-Source 222 informs the TM-Dest 224 toassume ownership of the key range, for example, via an asynchronous APIcall. At (6 a), the TM-Dest 224 selects a TS-Dest 232 to load thepartition and updates the TS-Dest details in the destination partitionstable 230 for the handed-off partition, which may be similar to anordinary (e.g. not during a handoff) partitions table update in someexamples. The TM-Dest 224 clears the flag indicating that the partitionhandoff to the destination storage cluster is in progress, after whichthe partition key range is owned by the TM-Dest 224.

At (6 b), the TM-Dest 224 instructs the TS-Dest 232 to load thepartition. The TS-Dest 232 loads the partition, and upon successful loadof the partition on the TS-Dest 232, the TS-Dest 232 may delete themetadata record containing the partition handoff state (e.g. the recordindicating that the partition handoff is in progress) entered on thesource storage cluster 202. Once the partition is loaded on thedestination storage cluster 204, live traffic is reopened and thepartition state record for live traffic is updated. If the TM-Dest 224fails over in step (6 a), the TM-Source 222 may continue retrying untilthe TM-Source 222 receives an acknowledgement. If the TM-Dest 224crashes or restarts in step (6 b), a subsequent load attempt mayreliably detect that this is the first load after handoff using themetadata stream record (e.g., using a record indicating that thepartition handoff is in progress).

As indicated by arrow (7 a), the TM-Source 222 polls/retries forcompletion of ownership transfer to the TM-Dest 224. More specifically,the TM-Dest 224 updates the destination partitions table 230 afterclearing the partition flag indicating that the partition handoff to thedestination storage cluster is in progress. Once the TM-Source 222receives an acknowledgment from the TM-Dest 224, the TM-Source 222updates the partition flag in its partitions table 228 to indicate thatthe partition handoff is complete, which signifies the transfer ofownership of the partition key range to the destination storage cluster204, as indicated by arrow (7 b). The redirection entry may helpre-direct FE requests to this key range to the destination storagecluster 204.

After redirection, the CSM-Dest 226 may migrate extents from the sourcestorage cluster 202 to the destination storage cluster 204 (or any otherstorage cluster in a destination limitless pool) without any urgency. Insome examples, an affinity policy may be set to not move data on allstreams by default, and to move data at capacity threshold on certainstreams. Such affinity policies may be retrofitted, or new policiesintroduced, such that extents of certain streams are given higherpreference over others for data transfer.

Once all extents of a partition are migrated to the destination storagecluster 204, the TM-Source 222 may delete partition streams from thesource storage cluster 202 using a CSM-Source API on an accountmigration engine. Clearing streams on the source storage cluster 202helps to free up capacity on the source storage cluster 202. In someexamples, the TM-Source 222 may wait until a full storage accountincluding all partitions of the storage account is transferred to thedestination storage cluster 204 to delete the partition streams. Amigration tool polls the status of migration on each of the streams onthe source storage cluster, and once all extents belonging to a streamare migrated and verified, the migration tool may proceed with cleanup.The streams and partitions entries for the transferred partitions may becleaned up around the same time. Otherwise, partition maintenance mayclean up the streams if there are no partition entries based on anassumption that the streams are orphaned streams.

As mentioned above, one or more partition flags and/or the partitionmetadata stream record for the partition handoff may be updated duringthe handoff phase 400. At the end of an example preparation phase 300,the source partitions table 228 may include a flag indicating a state ofbeginning to prepare for the partition handoff on the source storagecluster 202, the destination partitions table 230 may include a flagindicating the partition handoff is in progress on the destinationstorage cluster 204, and the metadata stream record for the partitionhandoff may include a record indicating preparation for the partitionhandoff on the source storage cluster is in progress. In one specificexample, the following partition flag and partition metadata streamrecord updates occur during the handoff phase 400. At processes indictedby arrows (1) and (2) in FIG. 4, the TM-Source 222 may update thepartition flag in the source partitions table 228 to indicate a state ofbeginning the partition handoff on the source storage cluster 202. Atprocesses indicated by arrows (3) and (4) in FIG. 4, the TS-Source 206may update the partition metadata stream record after blocking livetraffic to the source storage cluster 202, e.g. to a record indicatingthat the partition handoff is in progress. At the process indicated byarrow (5) in FIG. 4, after extent handoff, the TM-Source 222 may updatethe flag in the source partitions table 228 to indicate completion ofpartition handoff on the source storage cluster 202. At the processindicated by arrow (6 a) in FIG. 4, the TM-Dest 224 may update thedestination partitions table 230 after assuming ownership of the keyrange, e.g. to clear the flags related to partition handoff in thedestination partitions table 230. At the process indicated by arrow (6b) in FIG. 4, the TS-Dest 232, upon successful partition load, may clearthe partition metadata stream record for the partition handoff. At theprocess indicated by arrow (7 b) in FIG. 4, the TM-Source 222, afterreceiving an acknowledgement from the TM-Dest 224 regarding keyownership transfer, may update the partition flag in the sourcepartitions table 228 to indicate completion of the partition handoff.After all partitions to be migrated are transferred to the destinationstorage cluster 204 and LT cleanup is initiated, the TM-Source 222 mayclear all corresponding partition entries in the source partitions table228.

Following the handoff phase 400, the TS-156 of the destination storagecluster 204 serves the first partition belonging to the storage accountbeing migrated from the source storage cluster 202 to the destinationstorage cluster 204. The streams and underlying extents of the firstpartition are owned and managed by the CDM-Dest 226. The sourcepartitions table 228 includes a redirection entry for the firstpartition, which now points to the destination partitions table 230.Further, user requests for the storage account that are directed to FEroles on both the source and destination storage clusters 202, 204 arecorrectly redirected to the handed-off partition on TS-156 232 of thedestination storage cluster 204.

The stream layer may perform, as a background process, data verificationfor the extents being migrated. The stream layer also may performfailure handling and retry/alerting during the migration of extent data.In some examples, verification of migrated partition objects and theunderlying data may be performed by iterating through each object in apartition and reading the objects, data, and optionally the underlyingdata on the source and destination storage clusters, and comparing usinga geo-pipeline-based verification. In contrast, verification for thedisclosed handoff process (preparation phase 300 and handoff phase 400)may include extent metadata verification, extent reachability, tablelayer index validity, extent data integrity/verification, and geoverification.

Extent metadata verification may occur during the handoff phase, forexample, by determining that the extents and the order of the extents inthe source and the destination streams are the same at the time ofhandoff. The CSM-Source 210 and the CSM-Dest 226 may perform the extentmetadata verification. For example, verification may be performed inlineby CSM APIs during metadata migration (preparation and finalizationsteps). If an extent is missing or not in the correct order, a finalizecall fails and the partition begins serving user traffic on the sourcestorage cluster 202.

After a partition is handed off to the destination storage cluster 204,the distributed computing system determines whether all extents arereachable and readable from the destination storage cluster 204. Whenthe partition is handed off from the source storage cluster 202 andloads on the destination storage cluster 204, the destination storagecluster 204 may initiate a work item to determine extent reachability.For every stream in the partition, a scrubber may fetch all the extentsin each stream and try to read at least one byte per extent. Thescrubber also may perform extent length checks to ensure that the extentis reachable and known to the CSM-Dest 226. An alert is raised if anextent is unreachable. In one specific example, determining extentreachability for a data stream comprising approximately 200,000 extentsmay complete within an hour after handing off the partition to thedestination storage cluster 204.

Table layer index validity scrubbing may help to determine whether filetables include dangling pointers to the extents (the extents portion ofthe partition index that are not part of the destination streams). Forexample, a scrubber within the partition may be used to ensure thatextents are reachable on the destination storage cluster 204 stream andthat ordering is the same as ordering of the source storage cluster 202stream. Further, the TM-Source may ask a garbage collection (GC) masterto schedule a high priority GC run with zero rewrite for the partitionshanded off to the destination source cluster(s). In this manner, the GCrun may find any dangling pointers in the index or issues with extentsreachability. Since these are high priority GC runs with no datarewrites, they may be expected to complete before normal long-term (LT)cleanup (e.g. within seven days after data migration completion). DuringLT cleanup, blank partitions on the source storage cluster 202 areallowed to accept new traffic and migrated streams on the source storagecluster 202 are deleted from the source storage cluster 202 via theCSM-Source API. Before LT cleanup, a driver may check the last GCruntime completed or not completed, and fail LT cleanup if any GC runsare pending. A force LT cleanup may also be used to skip this check, insome examples.

Extent data integrity/verification may be performed after a partition ishanded off to the destination storage cluster 204, user traffic isrouted to the destination storage cluster 204 for the corresponding keyrange, and the extent data is copied to the destination storage cluster204. Once this verification succeeds, the extents on the source storagecluster 202 may be deleted. Because all extents are sealed before andduring the partition handoff, the extent data is immutable and can becompared on the source and destination storage clusters. The dataverification for the extents being migrated thus may be performed as abackground process separate from the user request path.

Since partitions may be handed off within the same data center, acrosszonal data centers within the same region, and/or across geographicallyseparated regions, the geo verification scans the partition index andvalidates the integrity of data stored using cyclic redundancy checks(CRCs).

As mentioned above, a partition handoff may fail, or an accountmigration may be unintentionally aborted during the preparation phaseand/or the handoff phase. Further, one or more of the TM-Source 222,TM-Dest 224, TS-Source 206, and TS-Dest 232 may restart or fail. In suchinstances, a TM master log may help to restart/resume operations.

When failure occurs during the preparation phase 300 during or after theTM-Dest 224 completes creation of streams on the destination storagecluster 204, the TM-Source may issue an abort command to the TM-Dest224. The abort command may delete the streams and remove the entry inthe destination partitions table 230. If the TM-Dest 224 command failswith non-retriable error, the TM-Source 222 may issue the same abortcommand to trigger cleanup. In another example, failure may occur duringor after the preparation for partition handoff from the TS-Source, wherethe TS-Source 206 interacts with the CSM-Source 210 to prepare extentmetadata handoff. The TM-Source 206 may issue an abort command orcleanup command (e.g. to reset partition handoff) to abort the cancel ofthe CSM-Source operation or cleanup if the extent metadata preparationis already complete. In such instances, the TM-Source 222 may alsoperform an API call to abort or cleanup the preparation for partitionhandoff on the TM-Dest 224.

Failure may also occur during the handoff phase. In examples wherefailure occurs during or after the partition handoff from the TS-Sourcecall, the TM-Source 222 may issue a command to abort the handoff fromthe TS-Source, cleanup on the destination storage cluster 204, andresume accepting user requests. The TM-Source 222 also may issue acommand to abort preparation for partition handoff from the TS-Source,to clean up the streams on the destination storage cluster 204. In otherexamples, failure or abort handling may include any suitable combinationof the disclosed rollback scenarios. Further, rollback after a partitionhandoff is complete may follow a partition handoff process for handingoff the partition from the destination storage cluster to the sourcestorage cluster.

As mentioned above, a partition transfer from a source storage clusterto a destination storage cluster may be performed as part of an accountmigration process. An account scheduler determines whether to migrate astorage account and to which destination clusters based on any number ofcriteria. Once a storage account and its destination storage cluster(s)are selected, and any preparatory steps for the partition handoffprocess are satisfied (e.g. pairing source and destination clusters), alocation service (e.g. a global and/or a regional location service)begins the migration of the storage account for example, by interactingwith a driver (e.g. an account migration engine) of the source anddestination storage clusters.

To prepare for account migration, the migration scheduler evaluates ahealth of the streams and extents of all partitions belonging to thestorage account. For example, the migration scheduler may determine thatan encryption key used to encrypt the partition index is the same on thesource and destination storage clusters. The migration scheduler alsomay determine a total number of extents per stream, a maximum extentreference count, etc., such that the extent handoff may complete withina targeted amount of time. In some examples, each partition handoff maycomplete within two minutes. In a more specific example, each partitionhandoff may complete within a few seconds.

Executing the account migration involves, for example, quarantining theaccount key range in all tables and replicating an account row. Toquarantine the account key range, a driver (e.g. an account migrationengine) of the destination storage cluster 204 creates the account rowand copies all XLS owned properties from the source storage cluster 202.Any future updates to the account row, such as those resulting from usercontrol plane operations, will be explicitly added to the appropriatepartitions and streams by XLS.

A driver (e.g. an account migration engine) of the source storagecluster 202 sends an account migration command, including an accountname and the destination storage cluster, to the TM-Source 222. TheTM-Source initiates the partition handoff process for one or morepartitions concurrently, where each partition is quarantines and a loadbalancer (LB) is blocked while the partition handoff is in progress. Insome examples, the partition handoff may complete within minutes. A TMof a secondary source storage cluster may also receive the same commandas TM-Source 222 and perform the same migration processes as the primarysource storage cluster.

Once all partitions are handed off to the destination storage cluster204, service metadata for the storage account (in the account row) onthe source storage cluster 202 is updated with redirection informationfor the storage account, which points to the destination storage cluster204. The redirection entry in the accounts table may comprise thedestination storage cluster name, as both the source and destinationstorage clusters load the accounts table of both clusters. The in-memoryredirection is conceptually similar to the partition table redirectionentry described above.

As long as both the source and destination storage clusters are paired,user requests landing on the source storage cluster 202 will becorrectly resolved to the destination partitions table 230. Once thestorage account is un-virtualized and cluster pairing is removed (e.g.,as part of a post-migration finalization, described in more detailbelow), user requests landing on the source storage cluster 202 (e.g.due to stale DNS) are dispatched to the destination storage cluster 204using the redirection entry.

Once the account table of the source storage cluster 202 includes theredirection entry to the destination storage cluster 204, the partitionstable redirection entries on the source storage cluster 202 are nolonger needed. Thus, after an account is handed off to the destinationstorage cluster 204, the distributing computing system may purge theredirection entries in the source partitions table 228 for all handedoff partitions. User requests landing on the FE of the source storagecluster 202 may cause the FE to lookup account tables of each storagecluster to determine the home storage cluster for the storage account,and lookup the partitions table of the home storage cluster to determinewhich TS serves the key range.

The migration scheduler may monitor the progress of extents beingmigrated by the stream layer to the destination cluster (e.g. using anAPI) before proceeding to post-migration finalization. Once all thepartitions of the storage account are handed off to the destinationstorage cluster and migration of the extents to the destination storagecluster is verified, the migration scheduler may un-virtualize thestorage account being migrated such that DNS records for the storageaccount point to the destination storage cluster 204. In this manner,the destination storage cluster 204 may receive customer requests,including when a customer request involves data not yet migrated fromthe source storage cluster 202 to the destination storage cluster 204.The migration scheduler also purges the accounts row in the accounttable of the source storage cluster 202 and cleans up the partitionsstreams on the source storage cluster 202. Optionally, the migrationscheduler unpairs the source and destination storage clusters through acluster resource manager (CRM) once all storage accounts (includingunderlying data) have been migrated from the source storage cluster 202to the destination storage cluster 204, e.g. to decommission the sourcestorage cluster 202.

As mentioned above, account migration involves pairing both source anddestination storage clusters through a CRM. If the storage clusterssupport migration across geographical regions, then the respectivesecondary storage clusters may be paired together by XLS sending anaccount migration command to the TM on source primary storage clusterand source secondary storage cluster. No coordination may be requiredbetween primary and secondary storage clusters, for both source anddestination storage cluster.

The examples disclosed herein are not limited to alleviating capacity,TPS, and CPU resources at an individual storage cluster level. In someexamples, the disclosed methods may be used to balance CPU, TPS, and/orcapacity resources across a group of storage clusters. FIG. 5 depicts anexample pairing 500 of two storage cluster groups for account migrationacross the storage cluster groups. In this example, a first clustergroup 502 has reached a threshold resource limit (e.g. CPU usage, TPS,and/or capacity), and pairing the first storage cluster group 502 with asecond storage cluster group 504 may not be an option due to scaleand/or performance reasons. To alleviate a burden on the first storagecluster group 502, accounts from storage cluster 1-1 of the firststorage cluster group 502 may be migrated to storage cluster 2-3 of thesecond storage cluster group 504.

The disclosed examples also support storage account migrations tomultiple destination clusters, as shown in FIG. 6. In this example, asource storage cluster 602 performs account migration to destinationstorage clusters 604 and 606 in parallel. This may help to speed updecommissioning of older hardware storage clusters by migrating storageaccounts to multiple destination storage clusters in parallel.

FIG. 7 illustrates an example method 700 for transferring a datapartition belonging to an account being migrated from a first storagecluster (source) to a second storage cluster (destination). Method 700may be implemented as stored instructions executable by a computingsystem, such as distributing computing system 102.

At 702, method 700 comprises determining that a data partition meets amigration criteria for migrating from the first storage cluster to thesecond storage cluster. Determining that the data partition meets themigration criteria may comprise determining that the first storagecluster is operating at or near a threshold based upon one or more ofTPS, CPU usage, and storage capacity, as indicated at 704. Determiningthat the data partition meets the migration criteria further maycomprise determining based upon a decommissioning of the first storagecluster, as indicated at 706. Determining that the data partition meetsthe migration criteria also may comprise determining that a data storageaccount comprising the data partition meets the migration condition, asindicated at 708.

At 710, method 700 comprises, on the first storage cluster, preparingpartition metadata to be transferred. The partition metadata describesone or more streams within the data partition and one or more extentswithin each stream. Preparing the partition metadata may comprisequarantining a key range of the data partition, blocking splits andmerges on the key range, and persisting an intention to begin handoff ofthe data partition in a partition table, as indicated at 712. Preparingthe partition metadata may also comprise, on the first storage cluster,creating mirror streams on the second storage cluster, as indicated at714.

At 716, method 700 comprises transferring the partition metadata fromthe first storage cluster to the second storage cluster. In someexamples, transferring the partition metadata may be performed via anasynchronous API call, as indicated at 718.

At 720, method 700 comprises directing new transactions associated withthe data partition to the second storage cluster, including while one ormore extents reside at the first storage cluster. At 722, method 700comprises, on the first storage cluster, changing an access attribute ofthe one or more extents within the data partition to read-only. At 724,method 700 comprises transferring the one or more extents, includingunderlying data within each extent, from the first storage cluster tothe second storage cluster. At 726, method 700 comprises, on the secondstorage cluster, performing new ingress for the data partition.

At 728, method 700 comprises determining whether the data storageaccount being migrated includes another data partition to transfer. Ifthere are additional data partitions within the data storage accountbeing migrated, then method 700 comprises, at 730, repeating methodsteps 702-726 for each data partition of the data storage account. Itwill be understood that multiple data partitions of the data storageaccount may be transferred concurrently, in various examples. When thedata storage account being migrated includes no additional datapartitions to transfer, then method 700 comprises, at 732, updating DNSserver information for the data storage account (e.g. to direct usertraffic to the second storage cluster).

The partition handoff performed in method 700 operates at a partitionboundary such that, once the source storage cluster hands off apartition to the destination storage cluster, new requests for thatpartition boundary directly land on the partition on destination storagecluster without needing to wait for migration of all partitions of astorage account to complete. Unlike migration operations that involvedeep copying of all objects, the examples disclosed herein allow trafficto be switched to the destination storage cluster without the usertransactions being fully caught up to a recovery point objective (RPO).

In some embodiments, the methods and processes described herein may betied to a computing system of one or more computing devices. Inparticular, such methods and processes may be implemented as acomputer-application program or service, an application-programminginterface (API), a library, and/or other computer-program product.

FIG. 8 schematically shows a non-limiting embodiment of a computingsystem 800 that can enact one or more of the methods and processesdescribed above. Computing system 800 is shown in simplified form.Computing system 800 may take the form of one or more personalcomputers, server computers, tablet computers, home-entertainmentcomputers, network computing devices, gaming devices, mobile computingdevices, mobile communication devices (e.g., smart phone), and/or othercomputing devices.

Computing system 800 includes a logic machine 802 and a storage machine804. Computing system 800 may optionally include a display subsystem806, input subsystem 808, communication subsystem 810, and/or othercomponents not shown in FIG. 8.

Logic machine 802 includes one or more physical devices configured toexecute instructions. For example, the logic machine may be configuredto execute instructions that are part of one or more applications,services, programs, routines, libraries, objects, components, datastructures, or other logical constructs. Such instructions may beimplemented to perform a task, implement a data type, transform thestate of one or more components, achieve a technical effect, orotherwise arrive at a desired result.

The logic machine 802 may include one or more processors configured toexecute software instructions. Additionally or alternatively, the logicmachine may include one or more hardware or firmware logic machinesconfigured to execute hardware or firmware instructions. Processors ofthe logic machine may be single-core or multi-core, and the instructionsexecuted thereon may be configured for sequential, parallel, and/ordistributed processing. Individual components of the logic machineoptionally may be distributed among two or more separate devices, whichmay be remotely located and/or configured for coordinated processing.Aspects of the logic machine may be virtualized and executed by remotelyaccessible, networked computing devices configured in a cloud-computingconfiguration.

Storage machine 804 includes one or more physical devices configured tohold instructions executable by the logic machine to implement themethods and processes described herein. When such methods and processesare implemented, the state of storage machine 804 may betransformed—e.g., to hold different data.

Storage machine 804 may include removable and/or built-in devices.Storage machine 804 may include optical memory (e.g., CD, DVD, HD-DVD,Blu-Ray Disc, etc.), semiconductor memory (e.g., RAM, EPROM, EEPROM,etc.), and/or magnetic memory (e.g., hard-disk drive, floppy-disk drive,tape drive, MRAM, etc.), among others. Storage machine 804 may includevolatile, nonvolatile, dynamic, static, read/write, read-only,random-access, sequential-access, location-addressable,file-addressable, and/or content-addressable devices.

It will be appreciated that storage machine 804 includes one or morephysical devices. However, aspects of the instructions described hereinalternatively may be propagated by a communication medium (e.g., anelectromagnetic signal, an optical signal, etc.) that is not held by aphysical device for a finite duration.

Aspects of logic machine 802 and storage machine 804 may be integratedtogether into one or more hardware-logic components. Such hardware-logiccomponents may include field-programmable gate arrays (FPGAs), program-and application-specific integrated circuits (PASIC/ASICs), program- andapplication-specific standard products (PSSP/ASSPs), system-on-a-chip(SOC), and complex programmable logic devices (CPLDs), for example.

The terms “program” may be used to describe an aspect of computingsystem 800 implemented to perform a particular function. In some cases,a program may be instantiated via logic machine 802 executinginstructions held by storage machine 804. It will be understood thatdifferent programs may be instantiated from the same application,service, code block, object, library, routine, API, function, etc.Likewise, the same program may be instantiated by differentapplications, services, code blocks, objects, routines, APIs, functions,etc. The term “program” may encompass individual or groups of executablefiles, data files, libraries, drivers, scripts, database records, etc.

It will be appreciated that a “service”, as used herein, is anapplication program executable across multiple user sessions. A servicemay be available to one or more system components, programs, and/orother services. In some implementations, a service may run on one ormore server-computing devices.

When included, display subsystem 806 may be used to present a visualrepresentation of data held by storage machine 804. This visualrepresentation may take the form of a graphical user interface (GUI). Asthe herein described methods and processes change the data held by thestorage machine, and thus transform the state of the storage machine,the state of display subsystem 806 may likewise be transformed tovisually represent changes in the underlying data. Display subsystem 806may include one or more display devices utilizing virtually any type oftechnology. Such display devices may be combined with logic machine 802and/or storage machine 804 in a shared enclosure, or such displaydevices may be peripheral display devices.

When included, input subsystem 808 may comprise or interface with one ormore user-input devices such as a keyboard, mouse, touch screen, or gamecontroller. In some embodiments, the input subsystem may comprise orinterface with selected natural user input (NUI) componentry. Suchcomponentry may be integrated or peripheral, and the transduction and/orprocessing of input actions may be handled on- or off-board. Example NUIcomponentry may include a microphone for speech and/or voicerecognition; an infrared, color, stereoscopic, and/or depth camera formachine vision and/or gesture recognition; a head tracker, eye tracker,accelerometer, and/or gyroscope for motion detection and/or intentrecognition; as well as electric-field sensing componentry for assessingbrain activity.

When included, communication subsystem 810 may be configured tocommunicatively couple computing system 800 with one or more othercomputing devices. Communication subsystem 810 may include wired and/orwireless communication devices compatible with one or more differentcommunication protocols. As non-limiting examples, the communicationsubsystem may be configured for communication via a wireless telephonenetwork, or a wired or wireless local- or wide-area network. In someembodiments, the communication subsystem may allow computing system 800to send and/or receive messages to and/or from other devices via anetwork such as the Internet.

Another example provides, enacted on a computing system, a method ofmigrating a data partition from a first storage cluster to a secondstorage cluster, each storage cluster being implemented via one or moreserver computers, the method comprising determining that the datapartition meets a migration criteria for migrating from the firststorage cluster to the second storage cluster, on the first storagecluster, preparing partition metadata to be transferred, the partitionmetadata describing one or more streams within the data partition andone or more extents within each stream, transferring the partitionmetadata from the first storage cluster to the second storage cluster,directing new transactions associated with the data partition to thesecond storage cluster, including while the one or more extents resideat the first storage cluster; on the first storage cluster, changing anaccess attribute of the one or more extents within the data partition toread-only, and on the second storage cluster, performing new ingress forthe data partition. In such an example, the method may additionally oralternatively comprise, after changing the access attribute of the oneor more extents within the data partition to read-only, transferring theone or more extents including underlying data within each extent fromthe first storage cluster to the second storage cluster. In such anexample, determining that the data partition meets the migrationcriteria may additionally or alternatively comprise determining that thefirst storage cluster is operating at or near a threshold based upon oneor more of transactions per second (TPS), CPU usage, and storagecapacity. In such an example, determining that the data partition meetsthe migration criteria may additionally or alternatively be based upon adecommissioning of the first storage cluster. In such an example,preparing the partition metadata to be transferred may additionally oralternatively comprise quarantining a key range of the data partition,blocking splits and merges on the key range, and persisting an intentionto begin a handoff of the data partition in a partition table. In suchan example, preparing the partition metadata to be transferred mayadditionally or alternatively comprise, on the first storage cluster,creating mirror streams on the second storage cluster. In such anexample, transferring the partition metadata from the first storagecluster to the second storage cluster may additionally or alternativelybe performed via an asynchronous API call. In such an example,determining that the data partition meets the migration criteria formigrating from the first storage cluster to the second storage clustermay additionally or alternatively comprise determining that a datastorage account comprising the data partition meets the migrationcriteria for migrating from the first storage cluster to the secondstorage cluster. In such an example, the method may additionally oralternatively comprise updating domain name system (DNS) serverinformation for the data storage account. In such an example, the datapartition may additionally or alternatively be a first data partition ofa plurality of data partitions determined to meet the migrationcriteria, and the method may additionally or alternatively comprise, onthe first storage cluster, preparing second partition metadata to betransferred, the second partition metadata describing one or morestreams within a second data partition and one or more extents withineach stream of the second data partition, transferring the secondpartition metadata from the first storage cluster to the second storagecluster, directing new transactions associated with the second datapartition to the second storage cluster, including while the one or moreextents within the second data partition reside at the first storagecluster, on the first storage cluster, changing an access attribute ofthe one or more extents within the second data partition to theread-only, and on the second storage cluster, performing new ingress forthe second data partition.

Another example provides a computing system, comprising a first storagecluster and a second storage cluster, each storage cluster beingimplemented via one or more server computers, and memory holdinginstructions executable by the logic subsystem to determine that a datapartition of the first storage cluster meets a migration criteria formigrating the data partition from the first storage cluster to thesecond storage cluster, on the first storage cluster, prepare partitionmetadata describing one or more streams within the data partition andone or more extents within each stream, transfer the partition metadatafrom the first storage cluster to the second storage cluster, direct newtransactions associated with the data partition to the second storagecluster, including while the one or more extents remain on the firststorage cluster, on the first storage cluster, change an accessattribute of the one or more extents within the data partition toread-only, and on the second storage cluster, perform new ingress forthe data partition. In such an example, the instructions mayadditionally or alternatively be executable to, after changing theaccess attribute of the one or more extents within the data partition toread-only, transfer the one or more extents including underlying datawithin each extent from the first storage cluster to the second storagecluster. In such an example, the instructions may additionally oralternatively be executable to determine that the data partition meetsthe migration criteria by determining that the first storage cluster isoperating at or near a threshold based upon one or more of transactionsper second (TPS), CPU usage, and storage capacity. In such an example,the instructions may additionally or alternatively be executable todetermine that the data partition meets the migration criteria basedupon a decommissioning of the first storage cluster. In such an example,the instructions executable to prepare the partition metadata to betransferred may additionally or alternatively be executable toquarantine a key range of the data partition, block splits and merges onthe key range, and persist an intention to begin a handoff of the datapartition in a partition table. In such an example, the instructionsexecutable to prepare the partition metadata to be transferred mayadditionally or alternatively be executable to, on the first storagecluster, create mirror streams on the second storage cluster. In such anexample, the instructions may additionally or alternatively beexecutable to transfer the partition metadata from the first storagecluster to the second storage cluster via an asynchronous API call. Insuch an example, the instructions may additionally or alternatively beexecutable to determine that the data partition meets the migrationcriteria for migrating from the first storage cluster to the secondstorage cluster by determining that a data storage account comprisingthe data partition meets the migration criteria for migrating from thefirst storage cluster to the second storage cluster. In such an example,the data partition may additionally or alternatively be a first datapartition of a plurality of data partitions determined to meet themigration criteria, and the instructions may additionally oralternatively be executable to, on the first storage cluster, preparesecond partition metadata to be transferred, the second partitionmetadata describing one or more streams within a second data partitionand one or more extents within each stream of the second data partition,transfer the second partition metadata from the first storage cluster tothe second storage cluster, direct new transactions associated with thesecond data partition to the second storage cluster, including while theone or more extents within the second data partition reside at the firststorage cluster, on the first storage cluster, change an accessattribute of the one or more extents within the second data partition tothe read-only, and on the second storage cluster, perform new ingressfor the second data partition.

Another example provides, enacted on a computing system, a method ofmigrating a data storage account from a first storage cluster to asecond storage cluster, the method comprising, for each key range of aplurality of key ranges within the data storage account, transferringmetadata for the key range from the first storage cluster to the secondstorage cluster, once metadata for all key ranges within the datastorage account has been transferred from the first storage cluster tothe second storage cluster, updating domain name system (DNS) serverinformation for the data storage account, and receiving customerrequests at the second storage cluster, including when a customerrequest involves data not yet migrated from the first storage cluster tothe second storage cluster.

It will be understood that the configurations and/or approachesdescribed herein are exemplary in nature, and that these specificembodiments or examples are not to be considered in a limiting sense,because numerous variations are possible. The specific routines ormethods described herein may represent one or more of any number ofprocessing strategies. As such, various acts illustrated and/ordescribed may be performed in the sequence illustrated and/or described,in other sequences, in parallel, or omitted. Likewise, the order of theabove-described processes may be changed.

The subject matter of the present disclosure includes all novel andnon-obvious combinations and sub-combinations of the various processes,systems and configurations, and other features, functions, acts, and/orproperties disclosed herein, as well as any and all equivalents thereof.

1. Enacted on a computing system, a method of migrating a data partitionfrom a first storage cluster to a second storage cluster, each storagecluster being implemented via one or more server computers, the methodcomprising: determining that the data partition meets a migrationcriteria for migrating from the first storage cluster to the secondstorage cluster; on the first storage cluster, preparing partitionmetadata to be transferred, the partition metadata describing one ormore streams within the data partition and one or more extents withineach stream; transferring the partition metadata from the first storagecluster to the second storage cluster; directing new transactionsassociated with the data partition to the second storage cluster,including while the one or more extents reside at the first storagecluster; on the first storage cluster, changing an access attribute ofthe one or more extents within the data partition to read-only; and onthe second storage cluster, performing new ingress for the datapartition.
 2. The method of claim 1, further comprising, after changingthe access attribute of the one or more extents within the datapartition to read-only, transferring the one or more extents includingunderlying data within each extent from the first storage cluster to thesecond storage cluster.
 3. The method of claim 1, wherein determiningthat the data partition meets the migration criteria comprisesdetermining that the first storage cluster is operating at or near athreshold based upon one or more of transactions per second (TPS), CPUusage, and storage capacity.
 4. The method of claim 1, whereindetermining that the data partition meets the migration criteria isbased upon a decommissioning of the first storage cluster.
 5. The methodof claim 1, wherein preparing the partition metadata to be transferredcomprises: quarantining a key range of the data partition; blockingsplits and merges on the key range; and persisting an intention to begina handoff of the data partition in a partition table.
 6. The method ofclaim 1, wherein preparing the partition metadata to be transferredcomprises, on the first storage cluster, creating mirror streams on thesecond storage cluster.
 7. The method of claim 1, wherein transferringthe partition metadata from the first storage cluster to the secondstorage cluster is performed via an asynchronous API call.
 8. The methodof claim 1, wherein determining that the data partition meets themigration criteria for migrating from the first storage cluster to thesecond storage cluster comprises determining that a data storage accountcomprising the data partition meets the migration criteria for migratingfrom the first storage cluster to the second storage cluster.
 9. Themethod of claim 8, further comprising updating domain name system (DNS)server information for the data storage account.
 10. The method of claim1, wherein the data partition is a first data partition of a pluralityof data partitions determined to meet the migration criteria, the methodfurther comprising: on the first storage cluster, preparing secondpartition metadata to be transferred, the second partition metadatadescribing one or more streams within a second data partition and one ormore extents within each stream of the second data partition;transferring the second partition metadata from the first storagecluster to the second storage cluster; directing new transactionsassociated with the second data partition to the second storage cluster,including while the one or more extents within the second data partitionreside at the first storage cluster; on the first storage cluster,changing an access attribute of the one or more extents within thesecond data partition to the read-only; and on the second storagecluster, performing new ingress for the second data partition.
 11. Acomputing system, comprising: a first storage cluster and a secondstorage cluster, each storage cluster being implemented via one or moreserver computers; and memory holding instructions executable by thelogic subsystem to: determine that a data partition of the first storagecluster meets a migration criteria for migrating the data partition fromthe first storage cluster to the second storage cluster; on the firststorage cluster, prepare partition metadata describing one or morestreams within the data partition and one or more extents within eachstream; transfer the partition metadata from the first storage clusterto the second storage cluster; direct new transactions associated withthe data partition to the second storage cluster, including while theone or more extents remain on the first storage cluster; on the firststorage cluster, change an access attribute of the one or more extentswithin the data partition to read-only; and on the second storagecluster, perform new ingress for the data partition.
 12. The computingsystem of claim 11, wherein the instructions are further executable to,after changing the access attribute of the one or more extents withinthe data partition to read-only, transfer the one or more extentsincluding underlying data within each extent from the first storagecluster to the second storage cluster.
 13. The computing system of claim11, wherein the instructions are executable to determine that the datapartition meets the migration criteria by determining that the firststorage cluster is operating at or near a threshold based upon one ormore of transactions per second (TPS), CPU usage, and storage capacity.14. The computing system of claim 11, wherein the instructions areexecutable to determine that the data partition meets the migrationcriteria based upon a decommissioning of the first storage cluster. 15.The computing system of claim 11, wherein the instructions executable toprepare the partition metadata to be transferred are executable to:quarantine a key range of the data partition; block splits and merges onthe key range; and persist an intention to begin a handoff of the datapartition in a partition table.
 16. A method of migrating a datapartition from a first storage cluster to a second storage cluster, eachstorage cluster being implemented via one or more server computers, themethod comprising: transferring partition metadata from the firststorage cluster to the second storage cluster, the partition metadatadescribing one or more streams within the data partition and one or moreextents within each stream; directing new transactions associated withthe data partition to the second storage cluster, including while theone or more extents reside at the first storage cluster; and on thesecond storage cluster, performing new ingress for the data partition.17. The method of claim 16, further comprising: on the first storagecluster, changing an access attribute of the one or more extents withinthe data partition to read-only; and after changing the access attributeof the one or more extents within the data partition to read-only,transferring the one or more extents including underlying data withineach extent from the first storage cluster to the second storagecluster.
 18. The method of claim 17, further comprising: creating a newextent, responsive to the operation of changing the access attribute,including data directed to be written to the one or more extentsassociated with the changed access attribute; and linking the new extentat an end of a stream associated with the one or more extents.
 19. Themethod of claim 16, further comprising: adding a redirection instructionto the first storage cluster for directing inquiries from the firststorage cluster to the second storage cluster during the data partitionmigration.
 20. The method of claim 16, further comprising: determiningthat the data partition meets a migration criterion for migrating fromthe first storage cluster to the second storage cluster, whereindetermining that the data partition meets the migration criterion: isbased on a determination that the first storage cluster is operating ator near a threshold based upon one or more of transactions per second(TPS), CPU usage, and storage capacity; is based on a determination thata data storage account including the data partition meets the migrationcriterion for migrating from the first storage cluster to the secondstorage cluster; or is based upon a decommissioning of the first storagecluster.
 21. The method of claim 20, further comprising updating domainname system (DNS) server information for the data storage account. 22.The method of claim 16, further comprising: loading, by the firststorage cluster an account table for the second storage cluster; andloading, by the second storage cluster, an account table for the firststorage cluster.
 23. The method of claim 16, wherein preparing thepartition metadata to be transferred comprises: quarantining a key rangeof the data partition; blocking splits and merges on the key range; andpersisting an intention to begin a handoff of the data partition in apartition table.
 24. The method of claim 16, wherein preparing thepartition metadata to be transferred comprises, on the first storagecluster, creating mirror streams on the second storage cluster.
 25. Themethod of claim 16, wherein transferring the partition metadata from thefirst storage cluster to the second storage cluster is performed via anasynchronous API call.
 26. A computing system, comprising: a firststorage cluster and a second storage cluster, each storage cluster beingimplemented via one or more server computers; and processor-executableinstructions stored in memory and executable by a logic subsystem to:transfer partition metadata from the first storage cluster to the secondstorage cluster, the partition metadata describing one or more streamswithin a data partition and one or more extents within each stream;direct new transactions associated with the data partition to the secondstorage cluster, including while the one or more extents remain on thefirst storage cluster; and on the second storage cluster, perform newingress for the data partition.
 27. The computing system of claim 26,wherein the processor-executable instructions are further executable to:on the first storage cluster, change an access attribute of the one ormore extents within the data partition to read-only; and after changingthe access attribute of the one or more extents within the datapartition to read-only, transfer the one or more extents includingunderlying data within each extent from the first storage cluster to thesecond storage cluster.
 28. The computing system of claim 27, whereinthe processor-executable instructions are further executable to: createa new extent, responsive to the change of the access attribute,including data directed to be written to the one or more extentsassociated with the changed read-only access attribute; and link the newextent at an end of a stream associated with the one or more extents.29. The computing system of claim 26, wherein the processor-executableinstructions are further executable to: add a redirection instruction tothe first storage cluster for directing inquiries from the first storagecluster to the second storage cluster responsive to determining that thedata partition satisfies a migration criterion for migrating from thefirst storage cluster to the second storage cluster.
 30. The computingsystem of claim 26, wherein the processor-executable instructions arefurther executable to: determine that a data partition of the firststorage cluster meets a migration criterion for migrating the datapartition from the first storage cluster to the second storage cluster,wherein the instructions are executable to determine that the datapartition meets the migration criterion: by determining that the firststorage cluster is operating at or near a threshold based upon one ormore of transactions per second (TPS), CPU usage, and storage capacity;by determining that a data storage account comprising the data partitionmeets the migration criterion for migrating from the first storagecluster to the second storage cluster; or based upon a decommissioningof the first storage cluster.
 31. The computing system of claim 26,wherein the processor-executable instructions are further executable to:quarantine a key range of the data partition; block splits and merges onthe key range; and persist an intention to begin a handoff of the datapartition in a partition table.
 32. The computing system of claim 26,wherein the processor-executable instructions are further executable tocreate mirror streams of the first storage cluster on the second storagecluster.
 33. The computing system of claim 26, wherein theprocessor-executable instructions are further executable to transfer thepartition metadata from the first storage cluster to the second storagecluster via an asynchronous API call.
 34. The computing system of claim26, wherein the data partition is a first data partition of a pluralityof data partitions determined to meet a migration criterion, and whereinthe processor-executable instructions are further executable to:transfer second partition metadata from the first storage cluster to thesecond storage cluster, the second partition metadata describing one ormore streams within a second data partition and one or more extentswithin each stream of the second data partition; direct new transactionsassociated with the second data partition to the second storage cluster,including while the one or more extents within the second data partitionreside at the first storage cluster; on the first storage cluster,change an access attribute of the one or more extents within the seconddata partition to be read-only; and on the second storage cluster,perform new ingress for the second data partition.
 35. One or moretangible processor-readable storage media embodied with instructions forexecuting on one or more processors and circuits a process for migratinga data storage account from a first storage cluster to a second storagecluster, the process comprising: for each key range of a plurality ofkey ranges within the data storage account, transferring metadata forthe key range from the first storage cluster to the second storagecluster; once metadata for all key ranges within the data storageaccount has been transferred from the first storage cluster to thesecond storage cluster, updating domain name system (DNS) serverinformation for the data storage account; and receiving customerrequests at the second storage cluster, including when a customerrequest involves data not yet migrated from the first storage cluster tothe second storage cluster.